Skip to content

SDF-based dataset support#32

Open
sfluegel05 wants to merge 27 commits intodevfrom
feature/sdf-support
Open

SDF-based dataset support#32
sfluegel05 wants to merge 27 commits intodevfrom
feature/sdf-support

Conversation

@sfluegel05
Copy link
Contributor

Add support for the SDF-based dataset (ChEB-AI/python-chebai#147). This includes mostly:

  • allowing Chem.Mol objects as input to _read_data functions
  • allowing Chem.Mol as input to read_property (also: the _read_data function now returns the augmented molecule dictionary which gets passed to read_property, avoiding a complete recalculation for each property)
  • using a standard sanitize_molecule function to ensure consistent SMILES parsing

One issue I came across while testing this: The AugAtomNumHs property (or any property inheriting the FrozenPropertyAlias) doesn't allow new tokens to be created. @aditya0by0 How should I add new tokens here? Should I first build a dataset with the non-augmented version of each property and then use those to created an augmented dataset?

The above exception was the direct cause of the following exception:
    return self.compute_fn(*args)
  File "/home/staff/s/sifluegel/python-chebai/chebai/cli.py", line 46, in call_data_methods
    data.setup()
  File "/home/staff/s/sifluegel/python-chebai/chebai/preprocessing/datasets/base.py", line 524, in setup
    self._after_setup(**kwargs)
  File "/home/staff/s/sifluegel/python-chebai-graph/chebai_graph/preprocessing/datasets/chebi.py", line 199, in _after_setup
    self._setup_properties()                                                                                        File "/home/staff/s/sifluegel/python-chebai-graph/chebai_graph/preprocessing/datasets/chebi.py", line 168, in _setup_properties
    property.on_finish()                                                                                            File "/home/staff/s/sifluegel/python-chebai-graph/chebai_graph/preprocessing/properties/base.py", line 206, in on_finish                                                                                                              raise ValueError(                                                                                             ValueError: AugAtomNumHs attempted to add new tokens to a frozen encoder at /home/staff/s/sifluegel/python-chebai-graph/chebai_graph/preprocessing/bin/AtomNumHs/indices_one_hot.txt

@aditya0by0
Copy link
Member

aditya0by0 commented Feb 26, 2026

One issue I came across while testing this: The AugAtomNumHs property (or any property inheriting the FrozenPropertyAlias) doesn't allow new tokens to be created. @aditya0by0 How should I add new tokens here?

Yes, the below approach will be a intended/correct approach. FrozenPropertyAlias was designed to reuse the non-augmented versions of properties when working with augmented data. Because of this design, alias properties are not allowed to create new tokens they can only use tokens that already exist. This restriction is intentionally enforced in the logic of FrozenPropertyAlias.

Should I first build a dataset with the non-augmented version of each property and then use those to created an augmented dataset

class FrozenPropertyAlias(MolecularProperty, ABC):
"""
Wrapper base class for augmented graph properties that reuse existing molecular properties.
This allows an augmented property class (with an 'Aug' prefix in its name) to:
- Reuse the encoder and index files of the base property by removing the 'Aug' prefix from its name.
- Prevent adding new tokens to the encoder cache by freezing it (using MappingProxyType).
Usage:
Inherit from FrozenPropertyAlias and the desired base molecular property class,
and name the class with an 'Aug' prefix (e.g., 'AugAtomType').
Example:
```python
class AugAtomType(FrozenPropertyAlias, AtomType): ...
```

@sfluegel05
Copy link
Contributor Author

I have created a regular dataset which has updated the tokens.txt files. I also changed the edge_dim parameter of the model accordingly. Now I am able to create an augmented dataset.

However, I get an error when trying to load the data. _merge_props_into_base fails with this error:

ValueError: Error assigning property 'AtomNodeLevel' values to node features: The expanded size of the tensor (24) must match the existing size (26) at non-singleton dimension 0.  Target sizes: [24, 3].  Tensor sizes: [26, 3]
Property values shape: torch.Size([26, 3]), expected (num_nodes, 3)
Node feature matrix shape: torch.Size([24, 203])

when trying to execute this statement (try-catch block added for debugging):

if isinstance(property, AllNodeTypeProperty):
    try:
        x[:, atom_offset : atom_offset + enc_len] = property_values
    except Exception as e:
        raise ValueError(
            f"Error assigning property '{property.name}' values to node features: {e}\n"
            f"Property values shape: {property_values.shape}, expected (num_nodes, {enc_len})\n"
            f"Node feature matrix shape: {x.shape}"
        )

It looks like the number of nodes in the dataset (here: 24) is not the same as the number of nodes for which AtomNodeLevel provides values (here: 26). I tried re-generating the dataset but still get the same results. @aditya0by0 Do you have an idea why this could happen?

@sfluegel05 sfluegel05 marked this pull request as ready for review March 6, 2026 14:37
@sfluegel05
Copy link
Contributor Author

I also changed the linting to using only ruff (mirroring the main repository: ChEB-AI/python-chebai#145)

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds SDF-based dataset support across the graph preprocessing pipeline by accepting RDKit Chem.Mol inputs, reusing augmented molecule structures for property computation, and standardizing molecule sanitization.

Changes:

  • Extend readers/datasets to accept str | Chem.Mol inputs and propagate augmented molecule dictionaries to read_property.
  • Standardize RDKit SMILES parsing via sanitize=False + shared _sanitize_molecule.
  • Update property token bins / model config edge dimensions and migrate linting to pre-commit + Ruff.

Reviewed changes

Copilot reviewed 15 out of 19 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
configs/model/res_aug_amgpool.yml Updates edge_dim to reflect additional bond feature(s).
configs/model/res_aug_aapool.yml Updates edge_dim to reflect additional bond feature(s).
configs/model/gat_aug_amgpool.yml Updates edge_dim to reflect additional bond feature(s).
configs/model/gat_aug_aapool.yml Updates edge_dim to reflect additional bond feature(s).
chebai_graph/preprocessing/reader/reader.py Adds RDKit-mol support and sanitization to readers; changes _read_data return shape for property reader.
chebai_graph/preprocessing/reader/augmented_reader.py Allows Chem.Mol inputs and returns augmented molecule dict alongside GeomData; updates property reading to accept dict.
chebai_graph/preprocessing/property_encoder.py Makes IndexEncoder.cache robustly mutable when adding new tokens.
chebai_graph/preprocessing/properties/properties.py Adds validation around mol and SMILES conversion before descriptor generation.
chebai_graph/preprocessing/properties/base.py Formatting-only assertion message restructuring.
chebai_graph/preprocessing/fg_detection/fg_aware_rule_based.py Uses shared sanitization when rebuilding fragment mols from SMILES.
chebai_graph/preprocessing/datasets/chebi.py Avoids redundant augmentation/property recomputation; supports tuple-returning _read_data for downstream property reading.
chebai_graph/preprocessing/bin/NumAtomBonds/indices_one_hot.txt Extends token set for NumAtomBonds.
chebai_graph/preprocessing/bin/BondType/indices_one_hot.txt Extends token set for BondType (adds UNSPECIFIED).
chebai_graph/preprocessing/bin/AtomNumHs/indices_one_hot.txt Extends token set for AtomNumHs.
chebai_graph/preprocessing/bin/AtomFunctionalGroup/indices_one_hot.txt Extends token set for AtomFunctionalGroup.
chebai_graph/models/dynamic_gni.py Formatting-only assertion message restructuring.
.pre-commit-config.yaml Migrates lint/format hooks to Ruff + updates hook set.
.github/workflows/pre-commit.yml Adds CI workflow to run pre-commit.
.github/workflows/lint.yml Removes prior standalone lint workflow.
Comments suppressed due to low confidence (1)

chebai_graph/preprocessing/reader/reader.py:170

  • GraphReader._read_data now calls _smiles_to_mol() which returns an RDKit Chem.Mol, but the method still asserts isinstance(mol, nx.Graph) and then iterates mol.nodes/mol.edges. This will fail at runtime for every input. Either restore the previous pysmiles/NetworkX parsing path, or refactor _read_data to build the NetworkX graph from the RDKit molecule (and adjust types/docs accordingly).
        # raw_data is a SMILES string
        try:
            mol = (
                self._smiles_to_mol(raw_data) if isinstance(raw_data, str) else raw_data
            )
        except ValueError:
            return None
        assert isinstance(mol, nx.Graph)
        d: dict[int, int] = {}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +135 to 147
# augment molecule graph if possible (this would also happen for the properties if needed, but this avoids redundancy)
if isinstance(self.reader, _AugmentorReader):
returned_results = []
for mol in features:
try:
r = self.reader._create_augmented_graph(mol)
except Exception:
r = None
returned_results.append(r)
mols = [
augmented_mol[1] if augmented_mol is not None else None
for augmented_mol in returned_results
]
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _setup_properties, when self.reader is an _AugmentorReader you call self.reader._create_augmented_graph(mol) for each entry in features. For existing SMILES-based datasets features is typically a str, but _create_augmented_graph expects an RDKit Chem.Mol and will fail (caught and turned into None), resulting in mols being all None and property extraction being skipped. Consider using self.reader.read_property(...) directly, or convert SMILES to Chem.Mol via the reader’s _smiles_to_mol before calling _create_augmented_graph.

Copilot uses AI. Check for mistakes.
Comment on lines +310 to +315
if isinstance(row["features"], tuple):
geom_data, _ = row[
"features"
] # ignore additional returned data from _read_data (e.g. augmented molecule dict)
else:
geom_data = row["features"]
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GraphPropertiesMixIn._merge_props_into_base now supports row["features"] being a tuple, but other mixins in this same module (e.g., the augmented-graph property mixins) still treat row["features"] as a GeomData and read masks from it. If the processed dataset stores (GeomData, augmented_molecule) (as _AugmentorReader._read_data now returns), those asserts/mask accesses will break at load time. Consider normalizing row["features"] to GeomData earlier (or updating the remaining call sites to unwrap tuples consistently).

Copilot uses AI. Check for mistakes.
Comment on lines +852 to +857


if __name__ == "__main__":
dataset = ChEBI25_WFGE_WGN_AsPerNodeType(chebi_version=248, subset="3_STAR")
dataset.prepare_data()
dataset.setup()
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file appears to include an executable __main__ block that runs dataset preparation/setup. For a library/module file this is easy to trigger accidentally and can lead to long-running side effects when users run the module directly. Consider removing it from the PR, or moving this into a dedicated CLI/script (or a reproducible notebook) if it’s meant as a debugging entrypoint.

Suggested change
if __name__ == "__main__":
dataset = ChEBI25_WFGE_WGN_AsPerNodeType(chebi_version=248, subset="3_STAR")
dataset.prepare_data()
dataset.setup()

Copilot uses AI. Check for mistakes.
Comment on lines +12 to +19
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: '3.10'

- name: Run pre-commit
uses: pre-commit/action@v3.0.1
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow YAML is invalid due to indentation: steps: must contain a properly-indented list (the - uses: entries should be indented under steps:). As written, GitHub Actions will fail to parse this workflow.

Suggested change
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Run pre-commit
uses: pre-commit/action@v3.0.1
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Run pre-commit
uses: pre-commit/action@v3.0.1

Copilot uses AI. Check for mistakes.
hooks:
- id: black
- id: black-jupyter # for formatting jupyter-notebook
# Use `pre-commit autoupdate` to update all the hook.
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor grammar: "update all the hook" should be "update all the hooks".

Suggested change
# Use `pre-commit autoupdate` to update all the hook.
# Use `pre-commit autoupdate` to update all the hooks.

Copilot uses AI. Check for mistakes.
raw_data (str | Chem.Mol): SMILES string or RDKit molecule object.

Returns:
GeomData | None: Graph data object or None if molecule parsing failed.
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _read_data docstring still documents returning GeomData | None, but the function signature and implementation now return a tuple (GeomData, mol). Please update the docstring return type/description to match the new behavior to avoid misleading callers.

Suggested change
GeomData | None: Graph data object or None if molecule parsing failed.
tuple[GeomData, Chem.Mol] | None: A tuple of (graph data object, RDKit molecule)
or None if molecule parsing failed.

Copilot uses AI. Check for mistakes.
Comment on lines +1917 to 1921
try:
frag_mol = _sanitize_molecule(frag_mol)
except Exception:
pass
frag_rings = frag_mol.GetRingInfo().AtomRings()
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frag_mol can be None when Chem.MolFromSmiles(frag, sanitize=False) fails, or it can become None if sanitization fails. In both cases the next line calls frag_mol.GetRingInfo(), which will raise an AttributeError. Add a None check (and decide whether to mark is_ring_fg as False or skip the fragment) before accessing ring info.

Suggested change
try:
frag_mol = _sanitize_molecule(frag_mol)
except Exception:
pass
frag_rings = frag_mol.GetRingInfo().AtomRings()
if frag_mol is not None:
try:
frag_mol = _sanitize_molecule(frag_mol)
except Exception:
# If sanitization fails, fall back to the original fragment molecule
pass
# If molecule creation or sanitization failed, treat as having no rings
if frag_mol is None:
frag_rings = ()
else:
try:
frag_rings = frag_mol.GetRingInfo().AtomRings()
except Exception:
frag_rings = ()

Copilot uses AI. Check for mistakes.
Comment on lines +222 to 225
if isinstance(result["features"], tuple):
result["features"], raw_data = result["features"]
if result is None or result["features"] is None:
return None
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_preprocess_smiles_for_pred checks isinstance(result["features"], tuple) before verifying that result is not None. Since the next line explicitly handles result is None, this can currently raise a TypeError/AttributeError when to_data(...) returns None. Move the result is None / result["features"] is None guard before dereferencing result["features"].

Suggested change
if isinstance(result["features"], tuple):
result["features"], raw_data = result["features"]
if result is None or result["features"] is None:
return None
if result is None or result["features"] is None:
return None
if isinstance(result["features"], tuple):
result["features"], raw_data = result["features"]

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants